Data Valuation with Gradient Similarity

High-quality data is crucial for accurate machine learning and actionable analytics, however, mislabeled or noisy data is a common problem in many domains. Distinguishing low- from high-quality data can be challenging, often requiring expert knowledge and considerable manual intervention. Data Valuation algorithms are a class of methods that seek to quantify the value of each sample in a dataset based on its contribution or importance to a given predictive task. These data values have shown an impressive ability to identify mislabeled observations, and filtering low-value data can boost machine learning performance. In this work, we present a simple alternative to existing methods, termed Data Valuation with Gradient Similarity (DVGS). This approach can be easily applied to any gradient descent learning algorithm, scales well to large datasets, and performs comparably or better than baseline valuation methods for tasks such as corrupted label discovery and noise quantification. We evaluate the DVGS method on tabular, image and RNA expression datasets to show the effectiveness of the method across domains. Our approach has the ability to rapidly and accurately identify low-quality data, which can reduce the need for expert knowledge and manual intervention in data cleaning tasks.


Introduction
Modern research and "big data" have led to remarkable discoveries and spurred many fields toward highthroughput data collection to capitalize on emerging methods in data science, machine learning, and artificial intelligence.Scientists involved in data collection go to great efforts to generate accurate and reproducible data, however, unavoidable measurement noise, batch effects, and natural stochasticity often lead to varying levels of data quality.Many foundational high-throughput datasets are affected by reproducibility and data quality issues, which often limit the actionable results of these resources [1,2,3,4].

Data Valuation
Data quality relates to the capacity of data to represent the underlying process.For example, the objective of photography is to gather information about a three-dimensional scene, while the purpose of measuring temperature is to reflect the kinetic energy of an object.Data quality issues can arise from many sources; for instance, chromatic aberration or lens imperfections in photography can distort images, creating inaccurate representations of a scene.Similarly, a miscalibrated thermometer might not measure temperature correctly.Data quality issues can be particularly problematic in machine learning [5,6,7], as a small subset of inaccurate samples can significantly degrade modeling performance even if the majority of samples are high-quality.Curating high-quality datasets can be challenging and usually requires expert knowledge of both the data generation process and the underlying process being measured.A more automated approach to quantify data quality is a class of algorithms called data valuation, which assigns a numerical value to each sample in a dataset that characterizes its usefulness toward a predictive task.In the right context, data valuation can effectively capture many aspects of data quality.While there are a number of published data valuation algorithms, many of them follow a similar overarching approach, in which the user must define: • Source dataset: The samples that will be valued.
Note that this is sometimes called the training dataset 1 .
• Target dataset: This dataset characterizes the task or goal of the data valuation, and the choice of alternative target datasets are liable to result in different data values.Note that this is sometimes called the validation dataset 2 .
• Learning algorithm: The choice of predictive model, e.g., Logistic regression, random forest, neural network, etc.
• Performance metric: The evaluation metric used to compare the learning algorithms predictions against the ground truth, e.g., Accuracy, area-under-the-receiver-operator-curve (for classification), mean-squared-error, r 2 (for regression), etc.
Provided these four user-defined elements, a Data Valuation algorithm then assigns a numerical value to each sample in the source dataset that quantifies the importance of a sample, or its contribution to the predictive performance of the learning algorithm as evaluated on the target dataset.This method can be used in a number of ways, such as: • Model Enhancement: To improve the predictive performance of a model by filtering lowquality data or identifying mis-labeled samples.
• Attribution: To quantify data value for monetary recompense or to quantify fair contribution, i.e., credit.
• Domain Adaptation: To identify samples from an alternative domain that are relevant to a target task.
• Efficiency: Reduce the compute resources (runtime or memory) required to train machine learning models.
Existing methods for data valuation include Leave-One-Out (LOO) [10], Data Shapley [8], and Data Valuation using Reinforcement Learning (DVRL) [9].Under some conditions, DVRL has been shown to out-perform both Data Shapley and LOO and has been applied to large datasets (more than 500k samples).In noisy or corrupted datasets, these methods can be used to significantly improve machine learning prediction performance by filtering low data values prior to model training.Additionally, 1 We use this naming convention to avoid confusion later since DVGS updates model parameters based on gradient from the "Target Dataset" rather than the "Source Dataset."The Data Shapley [8] and Data Valuation with Reinforcement Learning (DVRL) [9] would refer to this as the "Training" dataset. 2 The Data Shapley [8] and Data Valuation with Reinforcement Learning (DVRL) [9] would refer to this as the "Validation" dataset.data values were shown to effectively quantify data quality aspects such as the amount of noise in an image or incorrect class labels [8] (i.e., low values correlate with high-noise or mislabeled observations).As a demonstration of these methods, a recent paper used Data Shapley to value an x-ray image dataset for the prediction of pneumonia.By removing approximately 20% of their training data with the lowest data values, the authors were able to improve the test set prediction accuracy by more than 15%.Furthermore, when the authors inspected a subset of images with the lowest data values, they found it significantly enriched for mislabeled images [11].
A key aspect of Data Shapley is the definition of equitable data conditions [8], which we summarize as: • Nullity: If a sample does not affect model performance, it should have a value of zero.
• Equivalency: Two samples with equal contribution should have equal values.
• Additivity: The sum of samples data values should be equal to the data value of the grouped samples.
While these conditions are convenient descriptors of data in many settings, they are not required for most of the pragmatic tasks of data valuation.Furthermore, Data Shapley is the only data valuation method to our knowledge with theoretical justifications fulfilling these conditions.Other methods, such as DVRL, perform comparably or better in many data valuation applications, such as corrupted label identification [9].

Library of Integrated Network-Based Cellular Signatures
There are few, if any, datasets devoid of data quality issues, and addressing these challenges can improve the results of downstream analytics.A foundational dataset that has been highly impactful in modern research, especially in the cancer and drug-development domain, is the Library of Integrated Network-Based Cellular Signatures (LINCS) project.The LINCS program has generated high-dimension transcriptomic profiles (L1000 assay; 978 landmark genes) characterizing the effect of chemical and genetic perturbations across a range of cellular contexts, time points, and dosages [12].This data has been used successfully in many applications; however, a continued challenge with high-throughput data pipelines is the identification of low-quality samples.In 2016, a systematic quality control analysis of LINCS L1000 data showed that differentially expressed genes (DEGs) inferred from the L1000 platform were often unreliable.For example, only 30% of DEGs overlapped between any two selected control viral vectors in short-hairpin RNA (shRNA) perturbations [4].To address these issues, many researchers have proposed methods to improve the L1000 data analysis pipeline, including alternative approaches to peak deconvolution [13,14], and a novel method of aggregating bio-replicates in order to improve the noise-to-signal ratio [15,16].
A recent paper, which sought to use the LINCS L1000 dataset for the repurposing of COVID-19 drugs, proposed a simple but effective method of quantifying sample-level data quality by computing the average Pearson correlation (APC) between the replicates of a perturbation.Intuitively, if replicates are discordant, and therefore have low or negative pairwise correlations, then the resulting APC value is low; however, if the replicates are concordant and have high pairwise correlations, then the APC value is high.The authors went on to show that filtering L1000 data based on APC values could significantly improve the predictive accuracy of machine learning models [17].
Improvement of data quality in large publicly available datasets, such as the LINCS project, has the potential to markedly improve the usefulness and impact of these datasets.In addition, effective data quality metrics could be used to inform the selection of new conditions that will be most beneficial to select prediction tasks or to avoid conditions that are unlikely to be useful.

Related Work
Dataset Distillation is a related field, which attempts to distill knowledge from a large dataset into a small one by synthesizing a new dataset that is representative of the original dataset but much smaller [18,19].Adjacent to this domain is core-set or instance selection that focus on selecting a subset of a dataset that leads to comparable or better machine learning performance.In many pragmatic applications, data valuation can be seen as coreset or instance selection method; For instance, data valuation produces a ranked list of the samples in a given dataset, based on their value or usefulness towards a predictive task.A ranked list of observations can easily be treated as an instance selection problem by choice of a threshold.Selection of a data value threshold, either by post-hoc analysis or manual choice, reframes data valuation methods as a instance selection approach.Additionally, many of the evaluation techniques of common data valuation methods are analogous to instance selection (e.g., machine learning performance improvement goals).There is no analog for the equitable data value conditions described by Ghorbani et al. [8] in core-set or instance selection.Several notable methods of core-set or instance selection includes herding [20,21], distribution-matching [22,23] and incremental-gradient matching approaches [24].There have also been instance selection approaches for large language models, which require large amounts of data to train, and the choice of prompting can have drastic impacts on model performance [25,26].
Anomaly detection or outlier detection attempts to separate data instances that deviate from the majority of samples [27].Data valuation, especially when used to identify corrupted labels or characterizing exogenous feature noise, can be examined from the lens of anomaly detection.For instance, the DVRL Estimator model tries to learn a joint probability distribution of exogenous and endogenous features that maximizes predictive performance of a given learning algorithm.If we make the assumption that identifying in-distribution training data will lead to test performance generalization, then DVRL can be thought of as a method for separating anomalous (out-of-distribution) from normal samples (in-distribution).There have been countless methods introduced for anomaly detection, however, of particular relevance to this paper is a gradient-based anomaly representation for autoencoders proposed by Kwon et.al, which defines an anomaly score based on both reconstruction error and the gradient.[28].
There has also been significant research on how to train machine learning models in the presence of noisy or corrupted data.These methods range broadly and include meta learning, sample re-weighting schemes [29,30], noise-robust loss functions [31] and loss correction algorithms [32].These methods predominately focus on training high-performing models without explicitly removing corrupted or spurious observations; however, several of these methods use re-weighting schemes that rely on interim observation-specific weights and could be considered analogous to data values.

Contributions
Data valuation is an efficient and automated approach to characterizing sample informativeness, particularly in data cleaning tasks such as identifying incorrectly labeled or noisy samples.Existing data valuation methods, however, have limitations that hinder widespread application.Data Shapley does not scale well to large datasets and underperforms in certain tasks like corrupted label identification compared to DVRL.DVRL often exhibits high performance in data valuation applications, but is sensitive to hyperparameters, choice of dataset, and predictive model.It can be inconvenient and time consuming to tune the DVRL hyperparameters and is ineffective in some predictive tasks.Furthermore, while DVRL is significantly faster than Data Shapley, this method still requires sequential training of models to accurately estimate data values, which consumes significant computational resources.
In this paper, we introduce a novel data valuation method and compare it against baselines in two key tasks: 1) identifying corrupted labels and 2) identifying samples with high exogenous feature noise.We also explore the application of data valuation in unsupervised learning settings, which to our knowledge is the first method to evaluate this.Unsupervised data valuation is ideal for quantifying sample noise in biological data types such as 'omics sequencing data (RNA expression, DNA mutation, methylation, etc.).Finally, we apply our method to compute data values for the LINCS L1000 level 5 dataset, which contains more than 700,000 high-dimensional samples.Our method demonstrates performance comparable to that of Data Shapley [8] and DVRL [9] while being significantly more computationally efficient.The speed and scalability of our method make it applicable to large datasets, even with small compute budgets.Moreover, our method is robust to hyperparameters, making it userfriendly.
Although data quality metrics have been proposed for the LINCS L1000 dataset, such as the average Pearson correlation (APC) between replicates [17], our data valuation results offer an alternative data quality metric.We show that filtering data based on our data values results in equivalent or higher-performing models than data filtering based on APC.Additionally, we show that our method is more effective in capturing high-valued samples than the APC metric, which could be used to inform future data acquisition decisions.
2 Proposed Methods

Data Valuation with Gradient Similarity
We propose a method of Data Valuation with Gradient Similarity (DVGS), based on the premise that source samples with a loss surface similar to the target loss surface will be more useful to a shared predictive task than source samples with dissimilar loss surfaces.For instance, a training dataset loss surface with a similar shape and minima to the validation dataset loss surface is likely to positively contribute to the validation predictive task.This premise is visualized by a toy example in Figure 1.Analytically computing the loss criteria for all possible parameter values (i.e., the full loss surface) is intractable for most problems, and therefore a comprehensive comparison of loss surfaces is challenging.However, we can approximate the comparison of loss surfaces by comparing gradient similarities at select parameter values.Comparison of gradients is also advantageous as it factors out the absolute loss value.
Similarly to other data valuation methods, DVGS requires a target dataset that characterizes the desired predictive task.The target dataset may be of high quality, specific prediction domain, or a randomly sampled holdout set.Additionally, the user must define a differentiable predictive model that can be trained using stochastic gradient descent (SGD).The source dataset serves as input on which data valuation will occur, with the goal of characterizing useful or detrimental samples.To perform DVGS, we optimize model parameters using SGD on the target dataset and at each iteration compute the similarity of the target batch gradient to each source sample gradient.We posit that this approach will accurately estimate data values if the gradient similarities are measured in critical regions of the weight-space, such as regions commonly explored during optimization.This procedure is documented in Algorithm 1.We do not expect or justify that this approach satisfies the equitable data value conditions proposed by Ghorbani et al., however, we empirically demonstrate that this approach effectively characterizes data quality in many real-world prediction tasks while being simple, scalable, and easily extensible to a wide range of model architectures and predictive tasks.
Calculating the similarity between the gradients of the source samples and the target dataset requires a function that takes as input two high-dimensional gradient vectors and returns a single scalar characterizing similarity.Theoretically, any distance metric is applicable here, however, we chose to use cosine similarity because it produces easily interpreted values between [-1,1] and neglects vector magnitude.We were concerned that gradient magnitudes may vary between early-and late-stage training, and to avoid biasing data values by large gradient magnitudes, we rationalize that gradient magnitude should be ignored.
In for j = 0, 1, . . ., R do 4: x j , y j ∼ B i 5: ŷj ← f θ (x j ) ▷ predict outcome for target batch 6: end for 7: for k = 0, 1, . . ., N source do 9: x k , y k ∼ D s 10: ▷ predict outcome for source sample 11: ▷ compute the gradient for the source sample 12: ) ▷ compute similarity of source sample gradient to the target batch gradient Intuitively, the choice of initialization weights is likely to produce different data values, especially if the target set has a complex multimodal loss surface.To prevent variance in DVGS data values due to weight initialization or stochastic mini-batch sampling, we add the option to run the DVGS algorithm multiple times, each with unique weight initialization and randomization seeds.Using this approach enables DVGS to explore multiple minima and compute similarity values on a wider range of parameter values.To aggregate a final data value, gradient similarities are averaged across all iterations and runs.

Time Complexity
In most applications, it is reasonable to assume that the target dataset is much smaller than the source dataset, and therefore most of the runtime is spent computing the source gradients.This can be partially mitigated by only computing gradient similarities every T iterations or by pretraining the model.We estimate 1 the computational complexity in big O notation: We expect that the DVGS method will scale linearly with the number of source samples and training iterations.A particular advantage of the DVGS methods is that only a 1 See supplementary note 5.3 for experimental evaluation of time complexity.single model need be trained, whereas Data Shapley and DVRL require training many models sequentially.This time complexity makes it suitable for application to large datasets.Additionally, DVGS can be run in parallel and the results averaged to compute more accurate data values; Such an ensemble approach is ideal for large datasets and complex loss surfaces.In many tasks, such as image classification with convolutional neural networks, it can be advantageous to pretrain the convolutional layers prior to performing DVGS.

Data
In this paper, we apply our data valuation algorithm to four datasets under various conditions.
• The ADULT dataset, also known as the "census income" dataset, consists of 14 categorical or integer features representative of an adult individual and labeled based on whether they make more than 50k dollars per year [34].
• The BLOG dataset consists of internet blog characteristics parsed from the raw HTML file and the output is the average number of comments received; We then binarize the endogenous variable with threshold of 0 [35].
• The CIFAR10 dataset, which consists of tiny images labeled as one of 10 possible objects [36]; we transform the images into an informative feature representations using a pre-trained InceptionNet prior to data valuation [37].
• The LINCS L1000 dataset measures RNA expression in cell lines some time after a chemical or genetic perturbation [12] We further break the LINCS L1000 into two data partitions: 1) all data and 2) high-APC (>0.5) data (see supp.note 5.2).
We chose the first three datasets and pre-processing steps (ADULT, BLOG, and CIFAR10) to match the evaluations performed in previous work [9,8].Similarly, we try to match the respective dataset size (target, source, test) choices made in previous work to provide similar evaluations.
The LINCS L1000 is a widely used biological dataset that suffers from known data quality issues [12,16,14,13,1,17] and removing inaccurate or noisy samples from this dataset could benefit the cancer drug response domain.

Dataset Corruption
To simulate poor data quality, we artificially corrupt datasets in two ways: • Label Corruption; Endogenous variable (y) • Feature Corruption; Exogenous variable (x) Labels are corrupted by randomly relabeling a proportion of the source dataset class labels; for instance, an image of a "dog" might be re-labeled as "cat".The corrupted sample indices are then used as the ground truth of data quality and can be compared to data values.The expectation is that corrupted labels will have lower data values indicating that they are less valuable to model performance.To summarize the ability of data values to identify corrupted samples, we use the area under the receiver operator curve (AUROC) metric:

AU ROC(c, −ν)
Where c is the corrupted label mask (0 = uncorrupted; 1 = corrupted) and ν is the data values.Notably, we flip the data value sign as we expect large data values to indicate high quality data, and small data values to indicate low quality or mislabeled observations.
To explore the ability of data valuation to capture exogenous feature sample quality, we add Gaussian noise to each observation: x * i,j = N (0, ϕ i ) + x i,j where x * i,j is feature j of the corrupted sample i, and ϕ i is an observation-specific noise rate sampled from a uniform distribution.Thus, samples with larger noise rates (ϕ i ), will have noise with greater variance.The primary evaluation task is to apply data valuation and compare the data values with the sample-specific noise rates.We expect that samples with large noise rates will have small data values, indicating that they are less valuable to model performance.To evaluate performance on this task, we use Spearman correlation [38].Note that we change the sign of our data values as we expect that high data values should correlate with large noise rates: ρ = Spearman(ϕ, −ν) 3 Results

Label Corruption
To evaluate the ability of data values to capture mislabeled samples, we artificially corrupt labels in three classification datasets: ADULT, BLOG, and CIFAR10.We compare DVGS to several baseline methods: • Randomly assigned data values (null model) • Leave-out-out (LOO) [10] • Truncated Monte-Carlo Data Shapley (dshap) [8] • Data Valuation with Reinforcement learning (DVRL) [9] The Leave-one-out and Data Shapley algorithms are only applied to the ADULT and BLOG datasets due to compute resource constraints.
In all three datasets, we corrupt 20% of the labels.For the ADULT and BLOG datasets we use 1000 source observations and 400 target observations.For the CIFAR10 dataset, we use 5000 source observations and 2000 target observations.We expect accurate data valuation to produce values such that corrupted samples data values will be smaller than uncorrupted samples, indicating that they are less valuable or useful toward our target predictive task.Additionally, we expect that filtering corrupted labels should improve model performance.In each experiment, we evaluate the ability of data values to 1) identify corrupted labels and 2) modify model performance as measured on a hold-out test set when we filter a proportion of the dataset.In this second task, we evaluate the performance changes when we filter high-values (expectation that performance will decrease) versus low-values (expectation that performance will improve or be unaffected).
For all three datasets, we use a 2-layer neural network as the learning algorithm and the area under the receiver operator curve (AUROC) as the performance metric [39].Each experiment is run at least five times with randomly sampled data subsets and unique weight initialization.Experiments are repeated to ensure stable results across diverse subsets of data and weight initialization.
Figure 2 compares the ability of five data valuation methods to identify corrupt labels.Figure 3 compares the effects of filtering based on data values on performance.In all three datasets, DVGS performs comparably or better than baseline data valuation methods.DVGS performs particularly well on the CIFAR10 dataset, which may be due to the informative features extracted from a pretrained InceptionNet model [37].The predictive quality of the data values for the identification of corrupt labels is shown in Table 1.DVGS data values are the most predictive of corrupted labels in all three datasets, as measured by the AUROC score.DVRL often performed comparably to DVGS, however, DVRL convergence was inconsistent and occasionally resulted in a suboptimal policy, as evidenced by the wide confidence intervals of DVRL in Figure 2 and large standard deviations of CIFAR10 in Table 1.Additionally, we note that DVGS underperforms compared to Data Shapley when characterizing high data value, as seen in relative performance trends when filtering high-value data in Figure 3.

Characterization of Sample Noise
In many domains, input features may be noisy due to measurement error, natural stochasticity, or batch effects, leading to inaccurate sample informativeness.To explore the ability of data valuation to quantify input feature noise, we artificially corrupt exogenous features as described in Section 2. For this task, we evaluate data valuation in supervised (ADULT, BLOG and CIFAR10) and unsupervised learning (CIFAR10 and LINCS) settings.In the supervised setting, we use architectures and hyper-parameters identical to those described in Section 3.1.In unsupervised settings, we use an autoencoder architecture [40,41] to create a low-dimensional representation and optimize using reconstruction mean square error (MSE).We justify that noisy samples will be more difficult to reconstruct and are likely to be detrimental to the performance.For the unsupervised setting, we apply our methods to two datasets: the CIFAR10 dataset and a high-quality subset of the LINCS L1000 2 .The ability of the data values to characterize the exogenous feature noise rates is reported in Table 2. Compared to baseline methods, DVGS produces data values that most strongly correlate3 with ground-truth noise rates.As in Section 3.1, we also evaluate the performance impact of filtering data based on data values, and these results are shown in Figure 4. We find that DVGS can most effectively characterize noise rates across all datasets.Additionally, when we compare model performance improvements when low value data are removed, as shown by the solid lines in Figure 4, we find that the performance of the DVGS method is comparable to or better than the baseline methods.
As observed in the results of the supervised setting, we find that Data Shapley outperforms DVGS in quantifying high-quality data, measured by model performance decrease when filtering high-value data in both the ADULT and BLOG datasets, shown in Figure 4 (a,b).In some of the learning tasks listed in Table 2 only one or none of the baseline methods are calculated due to compute limitations.

Computational Complexity
DVGS can be applied to large datasets and complex tasks with markedly lower computational costs than previous data valuation methods and enables application to new domains and data types.In Table 3, we show the runtime of four data valuation algorithms.On average, DVGS is roughly five times faster than DVRL and more than 100 times faster than truncated Monte-Carlo (TMC) Data Shapley.Compared to DVRL and Data Shapley, which require sequential training of models on different subsets of data, the DVGS method requires training only one model.Furthermore, by computing the gradient similarities every T batches, the DVGS runtime can be reduced by a factor of T .In practice, we find that using values of T between 2 and 5 has a marginal impact on the performance of the data values used for corrupted label discovery.These experiments are described in more detail in Supplementary Section 5.3.

Data Valuation of the LINCS dataset
In this section, we apply our DVGS method to quantify LINCS L1000 sample quality across all chemical perturbations.In each experiment, we randomly sampled a target and a test set (5000 observations each) in two conditions: • Noisy Target set (high-APC).Target dataset sampled from all available observations.• Clean Target set (all-APC).Target dataset sampled from high-APC observations (APC > 0.5).
In both configurations, we adjust the target set sampling probabilities so that the target set is balanced by perturbation type.The source set consists of all samples that are not in the target or test sets.See Supplementary 5.2 for more information on APC calculation.
Data valuation of LINCS could be done in a supervised or unsupervised setting, however, we chose to use an unsupervised prediction task for the following reasons: • Simplicity: Encoding drug, cell line, concentration and measurement time requires additional overhead and may bias the results toward the encoding method chosen; e.g., encoded by drug targets, cell line expression, etc. • Imbalanced Dataset: drug perturbations and cell lines are not equally represented in the LINCS dataset, and this may cause bias toward the over represented drugs or cell lines.While this is a concern in an unsupervised setting, we rationalize that removing exogenous variables may help mitigate the issue.Additionally, to further mitigate this concern we select a target set with more balanced proportions of drug perturbations.• Noise Quantification: We consider measurement noise to be the primary data quality issue in the LINCS L1000 dataset and would like our data values to characterize sample noise rates.The results from Section 3 indicates that DVGS can effectively quantify sample noise using an unsupervised learning task.
For this task, we use an autoencoder with 2-layers in the encoder and decoder networks and 32 latent channels (embedding dimension).To avoid dependence on a specific target set, we ran the experiment several times (n ≥ 3) using different source, target, and test sets, as well as unique weight initializations.We compare the DVGS data values with the APC metric, proposed by Pham et al., to compare the generated data values to previous LINCS L1000 sample quality metrics.We evaluate    5 shows the performance comparison between the APC and DVGS data values.In the high-APC and all-APC conditions, we see that DVGS captures low data quality much better than the APC metric.In the all-APC condition, DVGS outperforms APC in capturing high-quality data, however, the DVGS data values and APC perform comparably in the high-APC condition.Additionally, we find that DVGS values and APC values correlate in the high-APC condition (Pearson Correlation ∼ 0.84) but not in the all-APC condition (Pearson Correlation ∼ -0.05).More specifically, in Figure 5c we see that high APC values are depleted for high data values, suggesting that DVGS data values in the all-APC condition may characterize a different aspect of data quality or usefulness than APC.

Discussion
In this work, we address scalability limitations of current data valuation methods by proposing a fast and robust method to estimate data values.We show that this method performs comparably or better than baseline methods in several tasks, including 1) identifying corrupted labels and 2) characterizing exogenous feature noise.Additionally, we have shown that our method works well to modify model performance when filtering data based on data values, and performs comparably or better than baselines when filtering low-value data.While Data Shapley and DVRL tend to lead to larger decreases in model performances when filtering high-value data, DVGS performs exceptionally well at identifying corrupted labels and noisy samples, especially in vision tasks using pretrained models.DVGS is also, on average, 100 times faster than Data Shapley (TMC) and 5 times faster than DVRL.This improvement in time complexity makes DVGS applicable to a wide range of datasets and domains.Additionally, in the reported experiments, DVGS was stable across hyperparameters (see Supplementary note 5.1), data par-tition, and weight initialization.These characteristics make DVGS convenient and robust for many applications in data cleaning and machine learning.
To show the value of our DVGS method in a real world scenario and to address data quality issues in a foundational dataset, we apply DVGS to the LINCS L1000 level 5 dataset that has more than 700k high-dimensional samples.We compare our method with a previous LINCS quality metric, the Average Pearson Correlation (APC), and show that our DVGS-produced data values are better able to modify model performance when filtering based on value.Interestingly, using a target dataset drawn randomly from the dataset (not necessarily high-quality) leads to data values that 1) do not correlate well with APC, and 2) significantly outperform APC as measured on a hold-out test set drawn from the full dataset.

Limitations and Future Directions
Similarly to DVRL, our DVGS method lack the equi- Through the lens of anomaly detection, DVGS can be viewed as a meta-learning algorithm that quantifies the similarity of the source samples to the target dataset and could potentially be used for anomaly detection.Additionally, this perspective may help explain why the DVGS method underperforms compared to baselines in identifying high-value data.For instance, if DVGS data values are considered a metric of similarity to the target set, then it may be that the most "similar" samples are not necessarily the most useful, whereas the most "dissimilar" data are likely erroneous or detrimental.It is therefore important that large data values be treated with caution.Additionally, it raises the question: how does DVGS handle redundant (or highly-similar) data in either the target or source datasets?Future work should address these concerns and characterize how redundancy can skew or alter DVGS data values.
While DVGS works remarkably well on the evaluations listed in this paper, we do recognize that it is rare for gradient-based learning algorithms to be trained on gradient from single samples (e.g., on-line learning) and that most optimization algorithms are trained using minibatches, thus implying that any sample's value or usefulness toward a predictive task cannot be considered independent of the other samples.Future work may wish to address this by looking at gradient similarity within mini-batches, or by selecting samples that align minibatch gradients to the target dataset.One can imagine bior multi-modal sample-gradients, all of which may align poorly to a target mini-batch gradient, but when source samples are averaged in a mini-batch may align far more closely.

DVGS Robustness to Hyperparameters
To test the robustness of the DVGS method with respect to algorithm hyperparameters, we performed a grid search on the ADULT dataset with 20% corrupted endogenous labels.We record the ability of DVGS to identify the corrupted labels across all tested hyperparameters.Figure 6 shows the cumulative distribution function (CDF) of the resulting AUROC values across all hyperparameters tested.Note that the AUROC metric characterizes the ability of data values to classify corrupted labels.We find that almost 85% of the tested hyperparameter configurations resulted in performances within 25% of the maximum performance, and more that 50% of the tested hyperparameters resulted in performance within 10% of the maximum performance, indicating that the DVGS method is robust to choice of hyperparameters.The hyperparameter grid search configurations are shown in Table 4.

Average Pearson Correlation (APC) metric
We compute the previously proposed Average Pearson Correlation (APC) [17] of LINCS level 4 replicates using the procedure: For a given level 5 LINCS sample: • Identify the level 4 bio-replicate sample ids that were used to generate the level 5 aggregate sample.
• Load the level 4 sample ID expression profile into memory • Filter to select only landmark genes (978) • Compute the average pairwise Pearson correlation of level 4 bio-replicates As shown in Figure 7, the resulting APC distribution is skewed right, with the majority of samples having an APC less than 0.5, suggesting that most of the replicates are highly discordant.Notably, future work may wish to perform data valuation directly on the level 4 samples, which may enable researchers to "rescue" high-quality replicates, even if the replicates are highly discordant.

Additional Runtime Experiments
In Figure 8 we show the experimental results of DVGS as the number of source samples increases.As expected, DVGS scales linearly with the number of source samples, divided by the period of gradient computations (T ).In Figure 8b we show the ability of DVGS to classify corrupted labels, when we increase the value of T , as one would expect, the AUROC value decreases with larger T, however, the marginal decrease in performance may be worthwhile for the improvements in runtime, especially on large datasets.When applying our method to the LINCS dataset, we were able to run 500 epochs of DVGS on 710,216 source samples using a multilayer autoencoder neural network (Number parameters > 650k) in roughly 8 hours on a Nvidia 3090 GPU.
The memory requirement of the DVGS method is in many ways comparable to classical SGD optimization problems; however, the computation of high-dimensional sample-wise gradients can increase the memory requirements.Therefore, as the number of model parameters increases, the memory footprint of the sample gradients will also increase.To mitigate this issue, we chose to compute sample gradients in mini-batches, which can be manually specified to fit a given task.Reducing the source batch size will therefore reduce the memory footprint, but lead to a small increase in computation time.
Additionally, the user can also choose to select a subset of all the model parameters to use for gradient computation, which will reduce memory overhead.

Figure 1 :
Figure 1:We propose a method of data valuation that compares each source sample to the target samples by computing the similarity of gradients during stochastic gradient descent.In panel A, we depict a toy-example of a 1-d loss landscape.Sample 1 (red) is an accurately labeled (high-quality), whereas sample 2 (blue) is incorrectly labeled (low quality).In panel B, we plot the similarity of each source sample gradient compared to the target set gradient (black solid line in panel A).Panel C shows the marginal distribution of gradient similarities, which is averaged to obtain the final source sample data value.To make this process tractable, gradient similarities are computed over a limited number of model parameter values during traditional stochastic gradient descent.The computed gradients are visualized by dotted lines in panels A,B and C (w0, w1,...,w3).To choose the relevant values of θ, we use stochastic gradient descent (SGD), with gradients calculated from the target set.

θ
i+1 ← θ i − α∇L target i ▷ update model parameters using the target batch gradient 15: end for 16: for k = 0, 1, . . ., N source do 17: ν k ← 1 Niter Niter i=0 ν i k ▷ compute the average gradient similarity for each source sample 18: end for should explore the comparison of within-class gradient similarities, which may mitigate this problem without class balancing.

Figure 2 :Figure 3 :
Figure 2: Evaluation of respective data valuation methods ability to identify corrupted labels.The Gray dashed "random" are theoretical random performance, whereas blue/cyan "random" is empirically measured random values.

Figure 4 :
Figure 4: The evaluation of respective data valuation methods ability to impact model performance when filtering either high value (dashed lines) or low values (solid lines).The y-axis measures the model performance using the AUROC metric.
(a) All-APC target set.(b) High-APC target set.(c) All-APC target set.(d) High-APC target set.

Figure 5 :
Figure 5: (a-b) The reconstruction performance (R 2 ) of autoencoders applied to the LINCS L1000 data when filtering low-and high-value data.(c-d) DVGS data values compared to APC values.

Figure 6 :
Figure 6: The cumulative distribution function (CDF) of AU ROC(ci, −νi) across all tested hyperparameters, where νi are data values generated by DVGS and ci are the corrupted labels label.The red dashed line demarcates all AUROC values larger than this are within 10% of the max AUROC value (e.g., roughly 55% of all tested hyperparameters resulted in an AU-ROC value within 10% of the max AUROC).

Figure 7 :
Figure 7: The Average Pearson Correlation (APC) distribution of level 5 LINCS samples.
(a) DVGS runtime on the ADULT dataset when computing gradient similarities every T steps.(b)Ability of DVGS to identify corrupted labels, with different values of T (period of source gradient computations).

Figure 8 :
Figure 8: The scalability and performance of the DVGS method dependant on number of source samples and the period of source similarity computations (T).
[33]s, then the source samples with the negative class may be particularly dissimilar, even if they are valuable to the optimization process.To avoid inadvertent bias of classbased data values, we suggest balancing class weights[33]when computing target gradients.Future approaches Algorithm 1 1: for i = 0, 1, . . ., N iter do

Table 3 :
Average runtime (in minutes) of 8 experiments.Experiments 1-3 were for label corruption; Experiments 4-6 were for noise characterization; Experiments 7 and 8 were unsupervised characterization of noise.
[8,9] data value properties proposed byGhorbani et al.,and therefore should not be interpreted in the same way; DVGS data values do not have a convenient interpretation like Data Shapley values.Rather, DVGS data values should be considered latent variables characterizing data usefulness, and we make no assumption about the linearity or magnitude of DVGS data values.These traits suggest that DVGS data values should be treated contextually as an ordered list of valuable samples.Pragmatically, ranked sample values meet the requirements of many of the evaluation techniques used by previous data valuation methods[8,9]including identifying corrupted labels and noise quantification.Future directions may consider learning a task-specific function to estimate Data Shapley values from DVGS data values, which would allow users to interpret the DVGS data values in a way comparable to Data Shapley.This could be done by performing DVGS data valuation and calculating a limited number of Data Shapley values, which could then be used as a training set to infer Data Shapley values from DVGS values.Such an approach may help merge the scalability advantages of DVGS with the interpretability of Data Shapley.

Table 4 :
The DVGS hyperparameter configurations tested in a grid search with 2 replicates per configuration.